2022-05-07

Introduction

Learning to use tidyverse for data exploration and modelling and bla bla

Materials

NHANES glycohemoglobin data

National Health and Nutrition Examination Survey data concerning glycohemoglobin levels and diabetes mellitus (DM) from https://hbiostat.org/data/.

Why this dataset?

  • Managable size: 20 variables, 6795 observations
  • Wide spectrum of variables
  • Contains missing values to handle
  • Explore correlations between diagnosis with DM and the other variables

The data

Variable Description Units Levels
seqn Unique patient ID
sex Gender 0, 1
age Age Years 12 - 80
re Race/ethnicity 5 levels: White, Black, Mexican, Other Hispanic, Other
income Family income level $ 14 levels from 0 - 100000
tx On Insulin or Diabetes meds 0, 1
dx Diagnosed with DM or pre-DM 0, 1
wt Weight kg 28 - 239.4
ht Height cm 123.3 - 202.7
bmi Body-mass index kg/m^2 13.18 - 84.87
leg Upper leg length cm 20.4 - 50.6
arml Upper arm length cm 24.8 - 47
armc Arm circumference cm 16.8 - 61
waist Waist circumference cm 52 - 179
tri Triceps skinfold thickness mm 2.6 - 41.1
sub Subscapular skinfold thickness mm 3.8 - 40.4
gh Glycohemoglobin % 4 - 16.4
albumin Albumin g/dL 2.5 - 5.3
bun Blood urea nitrogen mg/dL 1 - 90
SCr Serum Creatinine mg/dL 0.14 - 15.66

Variable types

Variable Description Units Levels
seqn Unique patient ID
sex Gender 0, 1
age Age Years 12 - 80
re Race/ethnicity 5 levels: White, Black, Mexican, Other Hispanic, Other
income Family income level $ 14 levels from 0 - 100000
tx On Insulin or Diabetes meds 0, 1
dx Diagnosed with DM or pre-DM 0, 1
wt Weight kg 28 - 239.4
ht Height cm 123.3 - 202.7
bmi Body-mass index kg/m^2 13.18 - 84.87
leg Upper leg length cm 20.4 - 50.6
arml Upper arm length cm 24.8 - 47
armc Arm circumference cm 16.8 - 61
waist Waist circumference cm 52 - 179
tri Triceps skinfold thickness mm 2.6 - 41.1
sub Subscapular skinfold thickness mm 3.8 - 40.4
gh Glycohemoglobin % 4 - 16.4
albumin Albumin g/dL 2.5 - 5.3
bun Blood urea nitrogen mg/dL 1 - 90
SCr Serum Creatinine mg/dL 0.14 - 15.66

DX does not differentiate between type I and type II

Variables containing NAs

Variable Description Units Levels
seqn Unique patient ID
sex Gender 0, 1
age Age Years 12 - 80
re Race/ethnicity 5 levels: White, Black, Mexican, Other Hispanic, Other
income Family income level $ 14 levels from 0 - 100000
tx On Insulin or Diabetes meds 0, 1
dx Diagnosed with DM or pre-DM 0, 1
wt Weight kg 28 - 239.4
ht Height cm 123.3 - 202.7
bmi Body-mass index kg/m^2 13.18 - 84.87
leg Upper leg length cm 20.4 - 50.6
arml Upper arm length cm 24.8 - 47
armc Arm circumference cm 16.8 - 61
waist Waist circumference cm 52 - 179
tri Triceps skinfold thickness mm 2.6 - 41.1
sub Subscapular skinfold thickness mm 3.8 - 40.4
gh Glycohemoglobin % 4 - 16.4
albumin Albumin g/dL 2.5 - 5.3
bun Blood urea nitrogen mg/dL 1 - 90
SCr Serum Creatinine mg/dL 0.14 - 15.66

Methods

Data journey

Data cleaning: Imputation of NAs

Variable Description Units Levels
income Family income level $ 14 levels from 0 - 100000

Here we assigned the mean from all non-NA values of income.

Variable Description Units Levels
leg Upper leg length cm 20.4 - 50.6
arml Upper arm length cm 24.8 - 47
armc Arm circumference cm 16.8 - 61
waist Waist circumference cm 52 - 179
tri Triceps skinfold thickness mm 2.6 - 41.1
sub Subscapular skinfold thickness mm 3.8 - 40.4

Here we implemented KNN (K=5) in tidyverse. We did not optimize K.

Data cleaning: Removal of outliers

Biochemical variables have more outliers

Variable Description Units Levels
SCr Serum Creatinine mg/dL 0.14 - 15.66

Normal range is 0.6 - 1.3 mg/dL, 5+ indicates severe kidney impairment. We removed all values above 5 (17 total values).

Results & Discussion

Principal Component Analysis

Investigating patterns in relation to diagnosis of diabetes melltius

Principal Component Analysis

Investigating of patterns in relation to BMI

K-means clustering

Identify relevant number of clusters

K-means clustering

CLusters between age and all other variables

Conclusion